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Abstract. This issue of Statistical Science draws its inspiration from 
the work of James M. Robins. Jon Wellner, the Editor at the time, 
asked the two of us to edit a special issue that would highlight the 
research topics studied by Robins and the breadth and depth of Robins’ 
contributions. Between the two of us, we have collaborated closely with 
Jamie for nearly 40 years. We agreed to edit this issue because we 
recognized that we were among the few in a position to relate the 
trajectory of his research career to date. * 1 


Many readers may be unfamiliar with Robins’ sin¬ 
gular career trajectory and in particular how his 
early practical experience motivated many of the in¬ 
ferential problems with which he was subsequently 
involved. Robins majored in mathematics at Har¬ 
vard College, but then, in the spirit of the times, left 
college to pursue more activist social and political 
goals. Several years later, Robins enrolled in Medical 
School at Washington University in St. Louis, grad¬ 
uating in 1976. His M.D. degree remains his only 
degree, other than his high school diploma. 
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i Here, we restrict attention to Robins’ contributions to the 
research literature. Robins has also contributed by training 
and mentoring leading researchers in causal inference: among 
others, Elizabeth Halloran, Miguel Hernan, Eric Tchetgen 
Tchetgen and Tyler VanderWeele worked with him as grad¬ 
uate students. Both of the editors of this Special Issue were 
greatly influenced by Robins’ as a graduate student and post¬ 
doc and have been fortunate to have subsequently collabo¬ 
rated with him over many years. 


After graduating, he interned in medicine at 
Harlem Hospital in New York. After completing the 
internship, Robins spent a year working as a pri¬ 
mary care physician in a community clinic in the 
Roxbury neighborhood of Boston. During that year, 
he helped organize a vertical Service Employees In¬ 
ternational Union affiliate that included all salaried 
personnel, from maintenance to physicians, working 
at the health center. In retaliation, he was dismissed 
by the director of the clinic and found that he was 
somewhat unwelcome at the other Boston commu¬ 
nity clinics. Unable to find a job and with his un¬ 
employment insurance running out, he surprisingly 
was able to obtain a prestigious residency in Inter¬ 
nal Medicine at Yale University, a testament, he says 
with some irony, to the enduring strength of one’s 
Ivy League connections. 

At Yale, Robins and his college friend Mark 
Cullen, now head of General Medicine at Stan¬ 
ford Medical School, founded an occupational health 
clinic, with the goal of working with trade unions 
in promoting occupational health and safety. When 
testifying in workers’ compensation cases, Robins 
was regularly asked whether it was “more probable 
than not that a worker’s death or illness was caused 
by exposure to chemicals in the workplace.” Robins’ 
lifelong interest in causal inference began with his 
need to provide an answer. As the relevant scientific 
papers consisted of epidemiologic studies and biosta- 
tistical analyses, Robins enrolled in biostatistics and 
epidemiology classes at Yale. He was dismayed to 
learn that the one question he needed to answer was 
the one question excluded from formal discussion 
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in the mainstream biostatistical literature. 2 At the 
time, most biostatisticians insisted that evidence for 
causation could only be obtained through random¬ 
ized controlled trials; since, for ethical reasons, po¬ 
tentially harmful chemicals could not be randomly 
assigned, it followed that statistics could play little 
role in disentangling causation from spurious corre¬ 
lation. 

1. CONFOUNDING 

In his classes, Robins was struck by the gap 
present between the informal, yet insightful, lan¬ 
guage of epidemiologists such as Miettinen and Cook 
(1981) expressed in terms of “confounding, compa¬ 
rability, and bias,” and the technical language of 
mathematical statistics in which these terms either 
did not have analogs or had other meanings. Robins’ 
first major paper “The foundations of confounding 
in Epidemiology” written in 1982, though only pub¬ 
lished in 1987, was an attempt to bridge this gap. As 
one example, he offered a precise mathematical def¬ 
inition for the informal epidemiologic concept of a 
“confounding variable” that has apparently stood 
the test of time (see VanderWeele and Shpitser, 
2013). As a second example, Efron and Hinkley 
(1978) had formally considered inference accurate to 
order n~ 3 / 2 in variance conditional on exact or ap¬ 
proximate ancillary statistics. Robins showed, sur¬ 
prisingly, that long before their paper, epidemiolo¬ 
gists had been intuitively and informally referring to 
an estimator as “unbiased” just when it was asymp¬ 
totically unbiased conditional on either exact or ap¬ 
proximate ancillary statistics; furthermore, they in¬ 
tuitively required that the associated conditional 
Wald confidence interval be accurate to 0(n -3 / 2 ) in 
variance. As a third example, he solved the prob¬ 
lem of constructing the tightest Wald-type intervals 


2 Robins and Greenland (1989a, 1989b) provided a formal 
definition of the probability of causation and a definitive an¬ 
swer to the question in the following sense. They proved that 
the probability of causation was not identified from epidemi¬ 
ologic data even in the absence of confounding, but that 
sharp upper and lower bounds could be obtained. Specifi¬ 
cally, under the assumption that a workplace exposure was 
never beneficial, the probability P(t) that a workers death 
occurring t years after exposure was due to that exposure 
was sharply upper bounded by 1 and lower bounded by 
max[0,{/i(f) - fo(t)}/fi(t)\, where /i(t) and f 0 (t) are, re¬ 
spectively, the marginal densities in the exposed and unex¬ 
posed cohorts of the random variable T encoding time to 
death. 


guaranteed to have conservative coverage for the av¬ 
erage causal effect among the n study subjects par¬ 
ticipating in a completely randomized experiment 
with a binary response variable; he showed that this 
interval can be strictly narrower than the usual bino¬ 
mial interval even under the Neyman null hypothesis 
of no average causal effect. To do so, he constructed 
an estimator of the variance of the empirical differ¬ 
ence in treatment means that improved on a vari¬ 
ance estimator earlier proposed by Neyman (1923). 
Aronow, Green and Lee (2014) have recently gener¬ 
alized this result in several directions including to 
nonbinary responses. 

2. TIME-DEPENDENT CONFOUNDING AND 
THE g-FORMULA 

It was also in 1982 that Robins turned his atten¬ 
tion to the subject that would become his grail: 
causal inference from complex longitudinal data 
with time-varying treatments, that eventually cul¬ 
minated in his revolutionary papers Robins (1986, 
1987b). His interest in this topic was sparked by (i) a 
paper of Gilbert (1982) 3 on the healthy worker sur¬ 
vivor effect in occupational epidemiology, wherein 
the author raised a number of questions Robins an¬ 
swered in these papers and (ii) his medical experi¬ 
ence of trying to optimally adjust a patient’s treat¬ 
ments in response to the evolution of the patient’s 
clinical and laboratory data. 

2.1 Overview 

Robins career from this point on became a “quest” 
to solve this problem, and thereby provide meth¬ 
ods that would address central epidemiological ques¬ 
tions, for example, is a given long-term exposure 
harmful or a treatment beneficial ? If beneficial, what 
interventions, that is, treatment strategies, are opti¬ 
mal or near optimal? 

In the process, Robins created a “bestiary” of 
causal models and analytic methods. 4 There are the 
basic “phyla” consisting of the g-formula, marginal 
structural models and structural nested models. 
These phyla then contain “species,” for example, 
structural nested failure time models, structural 


3 The author, Ethel Gilbert, is the mother of Peter Gilbert 
who is a contributor to this special issue; see (Richardson 
et al. (2014)). 

4 In the epidemiologic literature, this bestiary is sometimes 
referred to as the collection of “g-methods.” 
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nested distribution models, structural nested (multi¬ 
plicative additive and logistic) mean models and yet 
further “subspecies”: direct-effect structural nested 
models and optimal-regime structural nested mod¬ 
els. 

Each subsequent model in this taxa was developed 
to help answer particular causal questions in specific 
contexts that the “older siblings” were not quite up 
to. Thus, for example, Robins’ creation of structural 
nested and marginal structural models was driven 
by the so-called null paradox, which could lead to 
falsely finding a treatment effect where none existed, 
and was a serious nonrobustness of the estimated g- 
formula, his then current methodology. Similarly, his 
research on higher-order influence function estima¬ 
tors was motivated by a concern that, in the pres¬ 
ence of confounding by continuous, high dimensional 
confounders, even doubly robust methods might fail 
to adequately control for confounding bias. 

This variety also reflects Robins’ belief that the 
best analytic approach varies with the causal ques¬ 
tion to be answered, and, even more importantly, 
that confidence in one’s substantive findings only 
comes when multiple, nearly orthogonal, modeling 
strategies lead to the same conclusion. 

2.2 Causally Interpreted Structured Tree Graphs 

Suppose one wishes to estimate from longitudi¬ 
nal data the causal effect of time-varying treat¬ 
ment or exposure, say cigarette smoking, on a fail¬ 
ure time outcome such as all-cause mortality. In 
this setting, a time-dependent confounder is a time- 
varying covariate (e.g., presence of emphysema) that 
is a predictor of both future exposure and of fail¬ 
ure. In 1982, the standard analytic approach was 
to model the conditional probability (i.e., the haz¬ 
ard) of failure time t as a function of past expo¬ 
sure history using a time-dependent Cox propor¬ 
tional hazards model. Robins formally showed that, 
even when confounding by unmeasured factors and 
model specification are absent, this approach may 
result in estimates of effect that may fail to have a 
causal interpretation, regardless of whether or not 
one also adjusts for the measured time-dependent 
confounders in the analysis. In fact, if previous ex¬ 
posure also predicts the subsequent evolution of the 
time-dependent confounders (e.g., since smoking is 
a cause of emphysema, it predicts this disease) then 
the standard approach can find an artifactual expo¬ 
sure effect even under the sharp null hypothesis of 


no net, direct or indirect effect of exposure on the 
failure time of any subject. 

Prior to Robins (1986), although informal discus¬ 
sions of net, direct and indirect (i.e., mediated) ef¬ 
fects of time varying exposures were to be found 
in the discussion sections of most epidemiologic pa¬ 
pers, no formal mathematical definitions existed. 
To address this, Robins (1986) introduced a new 
counterfactual model, the finest fully randomized 
causally interpreted structured tree graph (FFR¬ 
CISTG) 5 model that extended the point treatment 
counterfactual model of Neyman (1923) and Rubin 
(1974, 1978a) 6 * to longitudinal studies with time- 
varying treatments, direct and indirect effects and 
feedback of one cause on another. Due to his lack 
of formal statistical training, the notation and for¬ 
malisms in Robins (1986) differ from those found in 
the mainstream literature; as a consequence the pa¬ 
per can be a difficult read.' Richardson and Robins 
(2013, Appendix C) present the FFRCISTG model 
using a more familiar notation. 8 

We illustrate the basic ideas using a simplified ex¬ 
ample. Suppose that we obtain data from an obser- 


5 A complete list of acronyms used is given before the Ref¬ 
erences. 

6 See Freedman (2006) and Sekhon (2008) for historical re¬ 
views of the counterfactual point treatment model. 

' Robins published an informal, accessible, summary of his 
main results in the epidemiologic literature (Robins (1987a)). 
However, it was not until 1992 (and many rejections) that his 
work on causal inference with time-varying treatments ap¬ 
peared in a major statistical journal. 

s The perhaps more familiar Non-Parametric Structural 
Equation Model with Independent Errors (NPSEM-IE) con¬ 
sidered by Pearl may be viewed as submodel of Robins’ FFR¬ 
CISTG. 

A Non-Parametric Structural Equation Model (NPSEM) 
assumes that all variables (U) can be intervened on. In con¬ 
trast, the FFRCISTG model does not require one to assume 
this. However, if all variables in V can be intervened on, then 
the FFRCISTG specifies a set of one-step ahead counterfac- 
tuals, Vm(vm- 1 ) which may equivalently be written as struc¬ 
tural equations V m (v m -i) = for functions f m 

and (vector-valued) random errors e m . Thus, leaving aside no- 
tational differences, structural equations and one-step ahead 
counterfactuals are equivalent. All other counterfactuals, as 
well as factual variables, are then obtained by recursive sub¬ 
stitution. 

However, the NPSEM-IE model of Pearl (2000) further 
assumes the errors e m are jointly independent. In contrast, 
though an FFRCISTG model is also an NPSEM, the errors 
(associated with incompatible counterfactual worlds) may be 
dependent—though any such dependence could not be de¬ 
tected in a RCT. Hence, Pearl’s model is a strict submodel of 
an FFRCISTG model. 
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vational or randomized study in which n patients 
are treated at two times. Let A\ and A 2 denote the 
treatments. Let L be a measurement taken just prior 
to the second treatment and let Y be a final out¬ 
come, higher values of which are desirable. To sim¬ 
plify matters, for now we will suppose that all of the 
treatments and responses are binary. As a concrete 
example, consider a study of HIV infected subjects 
with (A\,L,A 2 ,Y), respectively, being binary indi¬ 
cators of anti-retroviral treatment at time 1, high 
CD4 count just before time 2, anti-retroviral ther¬ 
apy at time 2, and survival at time 3 (where for 
simplicity we assume no deaths prior to assignment 
of A 2 ). There are 2 4 = 16 possible observed data se¬ 
quences for (A\,L, A 2 , Y); these may be depicted as 
an event tree as in Figure l. 9 Robins (1986) referred 
to such event trees as “structured tree graphs.” 

We wish to assess the effect of the two treatments 
( 01 , 02 ) on Y. In more detail, for a given subject 
we suppose the existence of four potential outcomes 



Ai L A 2 Y 


Fig. 1. Causal tree graph depicting a simple scenario with 
treatments at two times A\, A 2 , a response L measured prior 
to A 2 , and a final response Y. Blue circles indicate evolution 
of the process determined by Nature; red dots indicate poten¬ 
tial treatment choices. 


9 In practice, there will almost always exist baseline covari¬ 
ates measured prior to A\. In that case, the analysis in the 
text is to be understood as being with a given joint stratum of 
a set of baseline covariates sufficient to adjust for confounding 
due to baseline factors. 


y(ai,a 2 ) for ai,a 2 € {0,1} which are the outcome 
a patient would have if (possibly counter-to-fact) 
they were to receive the treatments a\ and 02 - Then 
E\Y(a\,a 2 )\ is the mean outcome (e.g., the survival 
probability) if everyone in the population were to re¬ 
ceive the specified level of the two treatments. The 
particular instance of this regime under which ev¬ 
eryone is treated at both times, so ai = a 2 = 1, is 
depicted in Figure 4(a). We are interested in esti¬ 
mation of these four means since the regime ( 01 , 02 ) 
that maximizes E[Y (oi, 02 )] is the regime a new pa¬ 
tient exchangeable with the n study subjects should 
follow. 

There are two extreme scenarios: If in an obser¬ 
vational study, the treatments are assigned, for ex¬ 
ample, by doctors, based on additional unmeasured 
predictors U of Y then E[Y ( 01 , 02 )] is not identified 
since those receiving ( 01 , 02 ) within the study are 
not representative of the population as a whole. 

At the other extreme, if the data comes from a 
completely randomized clinical trial (RCT) in which 
treatment is assigned independently at each time 
by the flip of coin, then it is simple to see that the 
counterfactual Y (ai, 02 ) is independent of the treat¬ 
ments (^ 1 ,^ 2 ) and that the average potential out¬ 
comes are identified since those receiving ( 01 , 02 ) in 
the study are a simple random sample of the whole 
population. Thus, 

(1) y(oi,o 2 ) _1L {A 1 ,A 2 }, 

(2) E[Y (ai,o 2 )] = E[Y | Ai = ai,A 2 = o 2 ], 

where the right-hand side of (2) is a function of the 
observed data distribution. In a completely random¬ 
ized experiment, association is causation: the associ- 
ational quantity on the right-hand side of (2) equals 
the causal quantity on the left-hand side. 

Robins, however, considered an intermediate trial 
design in which both treatments are randomized, 
but the probability of receiving A 2 is dependent 
on both the treatment received initially (Ai) and 
the observed response (L); a scenario now termed 
a sequential randomized trial. Robins viewed his 
analysis as also applicable to observational data as 
follows. In an observational study, the role of an 
epidemiologist is to use subject matter knowledge 
to try to collect in L sufficient data to eliminate 
confounding by unmeasured factors, and thus to 
have the study mimic a sequential RCT. If success¬ 
ful, the only difference between an actual sequen¬ 
tial randomized trial and an observational study is 
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that in the former the randomization probabilities 
Pr(^42 = 1 | L,Ai) are known by design while in the 
latter they must be estimated from the data. 10 

Robins viewed the sequential randomized trial as 
a collection of five trials in total: the original trial 
at t = 1, plus a set of four randomized trials at t = 2 
nested within the original trial. 11 Let the counter- 
factual L(a\) be the outcome L when A\ is set to 
asi. Since the counterfactuals Y(ai,a 2 ) and L(ai) do 
not depend on the actual treatment received, they 
can be viewed, like a subject’s genetic make-up, as a 
fixed (possibly unobserved) characteristic of a sub¬ 
ject and therefore independent of the randomly as¬ 
signed treatment conditional on pre-randomization 
covariates. That is, for each (ai,a 2 ) and l: 

(3) {Y(a 1 ,a 2 ),L(ai)} _LL A 1} 

(4) Y(ai,a 2 ) _LL A 2 | A\ = ai, L = l. 

These independences suffice to identify the joint 
density / y( 

cli , 012 ), L(ai)(y,l) of (Y(ai,a 2 ),L(ai)) from 
the distribution of the factual variables by the 
“g-computation algorithm formula” (or simply g- 
formula) density 

( 5 ) fa, ,a 2 (.V > 0 = f{y I ai,l,a 2 )f(l | ai) 

provided the conditional probabilities on the right- 
hand side are well-defined (Robins, 1986, page 
1423). Note that /*, 02 (?L0 is obtained from the 
joint density of the factuals by removing the treat¬ 
ment terms f(a 2 \ ai,l,a 2 )f(ai). This is in-line with 
the intuition that A\ and A 2 cease to be ran¬ 
dom since, under the regime, they are set by in¬ 
tervention to constants a\ and a 2 . The g-formula 
was later referred to as the “manipulated density” 
by Spirtes, Glymour and Schemes (1993) and the 
“truncated factorization” by Pearl (2000). 

Robins (1987b) showed that under the weaker 
condition that replaces (3) and (4) with 

Y(ai,a 2 ) -1L Ai and 

( 6 ) 

Y(ai,a 2 ) AL A 2 \ Ai = a\, L = l , 

the marginal density of Y(ai,a 2 ) is still identified 
by 

(7) fa u a 2 (y) =^2f(y I a 1 ,l,a 2 )f(l I ai), 

i 


10 Of course, one can never be certain that the epidemiolo¬ 
gists were successful which is the reason RCTs are generally 
considered the gold standard for establishing causal effects. 

n That is, the trials starting at t = 2 are on study popula¬ 
tions defined by specific (Ai, L)-histories. 


the marginal under f* t 02 (;t/,/). 12 Robins called (6) 
randomization w.r.t. Y. 13 Furthermore, he provided 
substantive examples of observational studies in 
which only the weaker assumption would be ex¬ 
pected to hold. It is much easier to describe these 
studies using representations of causal systems using 
Directed Acyclic Graphs and Single World Interven¬ 
tion Graphs, neither of which existed when (Robins 
(1987b)) was written. 

2.3 Causal DAGs and Single World Intervention 
Graphs (SWIGs) 

Causal DAGs were first introduced in the seminal 
work of Spirtes, Glymour and Scheines (1993); the 
theory was subsequently developed and extended by 
Pearl (1995a, 2000) among others. 

A causal DAG with random variables V\,..., Vm 
as nodes is a graph in which (1) the lack of an arrow 
from node Vj to V m can be interpreted as the ab¬ 
sence of a direct causal effect of Vj on V m (relative 
to the other variables on the graph), (2) all com¬ 
mon causes, even if unmeasured, of any pair of vari¬ 
ables on the graph are themselves on the graph, and 
(3) the Causal Markov Assumption (CMA) holds. 
The CMA links the causal structure represented by 
the Directed Acyclic Graph (DAG) to the statistical 
data obtained in a study. It states that the distribu¬ 
tion of the factual variables factor according to the 
DAG. A distribution factors according to the DAG 
if nondescendants of a given variable Vj are indepen¬ 
dent of Vj conditional on pa^, the parents of Vj . The 
CMA is mathematically equivalent to the statement 
that the density f(vi,..., vm) of the variables on the 
causal DAG Q satisfies the Markov factorization 

M 

( 8 ) f(vi,...,v M ) = Y[f(vj\p&j). 

3 = 1 

A graphical criterion, called d-separation (Pearl 
(1988)), characterizes all the marginal and condi¬ 
tional independences that hold in every distribution 
obeying the Markov factorization (8). 

Causal DAGs may also be used to represent the 
joint distribution of the observed data under the 
counterfactual FFRCISTG model of Robins (1986). 


12 The g-formula density for Y is a generalization of stan¬ 
dardization of effect measures to time varying treatments. See 
Keiding and Clayton (2014) for a historical review of stan¬ 
dardization. 

13 Note that the distribution of L(oi) is no longer identified 
under this weaker assumption. 




T. S. RICHARDSON AND A. ROTNITZKY 




Fig. 2. (a) A causal DAG Q describing a sequentially randomized trial; (b) the SWIG G(a\, 02 ) resulting from intervening 

on Ai and A 2 . 


This follows because an FFRCISTG model over the 
variables {V±,... ,Vm} induces a distribution that 
factors as (8). Figure 2(a) shows a causal Directed 
Acyclic Graph (DAG) corresponding to the sequen¬ 
tially randomized experiment described above: ver¬ 
tex H represents an unmeasured common cause 
(e.g., immune function) of CD4 count L and survival 
Y. Randomization of treatment implies A\ has no 
parents and A 2 has only the observed variables A\ 
and L as parents. 

Single-World Intervention Graphs (SWIGs), in¬ 
troduced in (Richardson and Robins (2013)), pro¬ 
vide a simple way to derive the counterfactual in¬ 
dependence relations implied by an FFRCISTG 
model. SWIGs were designed to unify the graphi¬ 
cal and potential outcome approaches to causality. 
The nodes on a SWIG are the counterfactual ran¬ 
dom variables associated with a specific hypothet¬ 
ical intervention on the treatment variables. The 
SWIG in Figure 2(b) is derived from the causal 
DAG in Figure 2(a) corresponding to a sequentially 
randomized experiment. The SWIG represents the 
counterfactual world in which A\ and A 2 have been 
set to ( 01 , 02 ), respectively. Richardson and Robins 
(2013) show that under the (naturally associated) 
FFRCISTG model the distribution of the counter- 
factual variables on the SWIG factors according to 
the graph. Applying Pearl’s d-separation criterion 
to the SWIG we obtain the independences (3) and 

(4)- 14 

Robins (1987b) in one of the aforementioned sub¬ 
stantive examples described an observational study 


14 More precisely, we obtain the SWIG independence 
Y(ai, 02 ) 1L ^ 2 ( 01 ) | Ai,L(ai), that implies (4) by the con¬ 
sistency assumption after instantiating Ai at a\. Note when 
checking d-separation on a SWIG all paths containing red 
“fixed” nonrandom vertices, such as ai, are treated as always 
being blocked (regardless of the conditioning set). 


of the effect of formaldehyde exposure on the mor¬ 
tality of rubber workers which can represented by 
the causal graph in Figure 3(a). (This graph cannot 
represent a sequential RCT because the treatment 
variable A\ and the response L have an unmeasured 
common cause.) Follow-up begins at time of hire; 
time 1 on the graph. The vertices H \, A\, H 2 , L 2 , 
A 2 , Y are indicators of sensitivity to eye irritants, 
formaldehyde exposure at time 1, lung cancer, cur¬ 
rent employment, formaldehyde exposure at time 2 
and survival. Data on eye-sensitivity and lung can¬ 
cer were not collected. Formaldehyde is a known eye- 
irritant. The presence of an arrow from H\ to A\ but 
not from H\ to A 2 reflects the fact that subjects 
who believe their eyes to be sensitive to formalde¬ 
hyde are given the discretion to choose a job with¬ 
out formaldehyde exposure at time of hire but not 
later. The arrow from H\ to L reflects the fact that 
eye sensitivity causes some subjects to leave employ¬ 
ment. The arrows from H 2 to L 2 and Y reflects the 
fact that lung cancer causes both death and loss of 
employment. The fact that H\ and H 2 are indepen¬ 
dent reflects the fact that eye sensitivity is unrelated 
to the risk of lung cancer. 

From the SWIG in Figure 3(b), we can see that 
(6) holds so we have randomization with respect to 
Y but L(ai) is not independent of A\. It follows 
that the g-formula fa 1CL2 {y) equals the density of 
y( 01 , 02 ) even though (i) the distribution of L(a\) 
is not identified and (ii) neither of the individual 
terms f(l \ a\) and f(y \ 01 ,^ 02 ) occurring in the 
g-formula has a causal interpretation. 15 


15 Above we have assumed the variables Ai, L, A 2 , Y oc¬ 
curring in the g-formula are temporally ordered. Interestingly, 
Robins (1986, Section 11) showed identification by the g- 
formula can require a nontemporal ordering. In his analysis 
of the Healthy Worker Survivor Effect, data were available on 
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Fig. 3. Formaldehyde study: Hi, indicator of sensitivity to eye irritants; A\, formaldehyde exposure at time 1; H 2 , lung 
cancer; L, current employment; A 2 , formaldehyde exposure at time 2; Y, survival. Hi and H 2 are unmeasured, (a) A causal 
DAG Q in which initial treatment is confounded, while the second treatment is sequentially randomized; (b) the SWIG G(ai,a 2 ). 
L is known to have no direct effect on Y except indirectly via the effect on A 2 ; Hi influences Ai but not A 2 . See text for 
further explanation. 


Subsequently, Tian and Pearl (2002a) developed a 
graphical algorithm for nonparametric identification 
that is “complete” in the sense that if the algorithm 
fails to derive an identifying formula, then the causal 
quantity is not identified (Shpitser and Pearl, 2006; 
Huang and Valtorta, 2006). This algorithm strictly 
extends the set of causal identification results ob¬ 
tained by Robins for static regimes. 

2.4 Dynamic Regimes 

The “g” in “g-formula” and elsewhere in Robins’ 
work refers to generalized treatment regimes g. The 
set G of all such regimes includes dynamic regimes 
in which a subject’s treatment at time 2 depends 
on the response L to the treatment at time 1. 
An example of a dynamic regime is the regime in 
which all subjects receive anti-retroviral treatment 
at time 1, but continue to receive treatment at time 
2 only if their CD4 count at time 2 is low, indi¬ 
cating that they have not yet responded to anti¬ 
retroviral treatment. In our study with no baseline 
covariates and A± and A2 binary, a dynamic regime 
g can be written as g = (01,52(0) where the function 
52(0 specifies the treatment to be given at time 2. 
The dynamic regime above has (ai = 1,52(0 = 1 — 0 


temporally ordered variables (Ai, Li, A 2 , L 2 , Y) where the Lt 
are indicators of survival until time year t, At is the indicator 
of exposure to a lung carcinogen and, there exists substantive 
background knowledge that carcinogen exposure at t cannot 
cause death within a year. Under these assumptions, Robins 
proved that equation ( 6 ) was false if one respected temporal 
order and chose Lto be Ii, but was true if one chose L = L 2 ■ 
Thus, E[Y(a 1 , 02 )] was identified by the g-formula fa 1 ,a 2 (v ) 
only for L = L, 2 - See (Richardson and Robins, 2013, page 54) 
for further details. 


and is highlighted in Figure 4. If L is binary, then 
G consists of 8 regimes comprised of the 4 ear¬ 
lier static regimes ( 01 , 02 ) and 4 dynamic regimes. 
The g-formula density associated with a regime 
g = {ai,g 2 (l)) is 

= f(l I ai)f(y I Ai = a u L = l,A 2 =g 2 (l)). 

Letting Y ( 5 ) be a subject’s counterfactual outcome 
under regime 5 , Robins (1987b) proves that if both 
of the following hold: 

Y(g)ALA 1 , 

(9) 

Y(g) _1L A 2 I Ai = oi, L = l 

then /y( 9 )(y) is identified by the g-formula density 
for Y: 

r 9 (y) = ^r g (y,i) 

l 

= J2f(y\ Al = a Y L = l i A 2=g2{l)) 

l 

■ f(l I ai)- 

Robins (1987b) refers to (9) as the assumption that 
regime 5 is randomized with respect to Y. Given a 
causal DAG, Dynamic SWIGs (dSWIGS) can be 
used to check whether (9) holds. Tian (2008) gives 
a complete graphical algorithm for identification of 
the effect of dynamic regimes based on DAGs. 

Independences (3) and (4) imply that (9) is true 
for all 5 G G. For a drug treatment, for which, say, 
higher outcome values are better, the optimal regime 
5 o P t maximizing E[Y ( 5 )] over 5 G G is almost always 
a dynamic regime, as treatment must be discontin¬ 
ued when toxicity, a component of L , develops. 
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Fig. 4. Tree graphs depicting specific treatment regimes: (a) 01 = a 2 = 1; (b) the dynamic regime a\ = 1, GE 2 = (1 — l). The 
red paths indicate all possible observed data sequences under these regimes. 


Robins (1986, 1989, page 1423) used the g-nota- 
tion f(y | g) as a shorthand for fY( g )(y) i n order 
to emphasize that this was the density of Y had 
intervention g been applied to the population. In the 
special case of static regimes ( 01 , 02 ), he wrote f(y \ 
g = (ai,o 2 )). 16 

2.5 Statistical Limitations of the Estimated 
g-Formulae 

Consider a sequentially randomized experiment. 
In this context, randomization probabilities /( 01 ) 
and /(a 2 | ai,Z) are known by design; however, the 
densities f(y | ai,a 2 ,Z) and f(l \ a\) are not known 
and, therefore, they must be replaced by estimates 
f(y | oi,a 2 ,Z 2 ) and /(/ | ai) in the g-formula. If the 
sample size is moderate and l is high dimensional, 
these estimates must come from fitting dimension- 
reducing models. Model misspecification will then 
lead to biased estimators of the mean of T(ai,a 2 ). 
Robins (1986) and Robins and Wasserman (1997) 
described a serious nonrobustness of the g-formula: 
the so-called “null paradox”: In biomedical trials, 
it is frequently of interest to consider the possi¬ 
bility that the sharp causal null hypothesis of no 
effect of either A\ or A 2 on Y holds. Under this 
null, the causal DAG generating the data is as in 
Figure 2 except without the arrows from Ai, A 2 


16 Pearl (1995a) introduced an identical notation except 
that he substituted the word “do” for “(? =,” thus writing 
f{y | do(ar,a 2 )). 


and L into Y Then, under this null, although 
fai ,a 2 (■ y) = J2if(y I at,Z,a 2 )/(Z | ai) does not de¬ 
pend on (ai,a 2 ), nonetheless both f(y | ai,Z,a 2 ) and 
/(Z | ai) will, in general, depend on a\ (as may be 
seen via d-connection). 18 In general, if L has discrete 
components, it is not possible for standard nonsat- 
urated parametric models (e.g., logistic regression 
models) for both f(y | ai,a 2 ,Z 2 ) and /(Z 2 | ai) to 
be correctly specified, and thus depend on ai and 
yet for a2 (y) not to depend on ai. 19 As a conse¬ 
quence, inference based on the estimated g-formula 
must result in the sharp null hypothesis being falsely 
rejected with probability going to 1, as the trial size 
increases, even when it is true. 

2.6 Structural Nested Models 20 

To overcome the null paradox, Robins (1989) and 
Robins et al. (1992) introduced the semiparametric 
structural nested distribution model (SNDMs) for 
continuous outcomes Y and structural nested failure 
time models (SNFTMs) for time to event outcomes. 
See Robins (1997a, 1997b) for additional details. 


17 If the L—>Y edge is present, then A\ still has an effect 
on Y. 

18 The dependence of f{y \ ai, l, a 2 ) on ai does not represent 
causation but rather selection bias due to conditioning on the 
common effect L of Hi and Ai. 

19 But see Cox and Wermuth (1999) for another approach. 

20 These models are discussed by Vansteelandt and Joffe 
(2014) in this issue. 
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Robins (1986, Section 6 ) defined the g-null hy¬ 
pothesis as 


( 10 ) 


Hq : the distribution of Y ( g ) 
is the same for all g € G. 


This hypothesis is implied by the sharp null hypoth¬ 
esis of no effect of A\ or A 2 on any subject’s Y. If 
(9) holds for all g € G, then the g-null hypothesis is 
equivalent to any one of the following assertions: 

(i) fp(y) equals the factual density f(y) for all g G 

G; 

(ii) Y _IL A\ and Y _IL A 2 | L, A \; 

(iii) /* (y) does not depend on ( 01 , 02 ) and Y _IL 

A.2 I L,Ai] 

see Robins (1986, Section 6 ). In addition, any one 
of these assertions exhausts all restrictions on the 
observed data distribution implied by the sharp null 
hypothesis. 

Robins’ goal was to construct a causal model in¬ 
dexed by a parameter ip* such that in a sequen¬ 
tially randomized trial (i) ip* = 0 if and only if the 
g-null hypothesis ( 10 ) was true and (ii) if known, one 
could use the randomization probabilities to both 
construct an unbiased estimating function for ip* 
and to construct tests of ip* = 0 that were guar¬ 
anteed (asymptotically) to reject under the null at 
the nominal level. The SNDMs and SNFTMs ac¬ 
complish this goal for continuous and failure time 
outcomes Y. Robins (1989) and Robins (1994) also 
constructed additive and multiplicative structural 
nested mean models (SNMMs) which satisfied the 
above properties except with the g-null hypothesis 
replaced by the g-null mean hypothesis: 

(11) H 0 :E[Y(g)] = E[Y] for all g € G. 

As an example, we consider an additive structural 
nested mean model. Define 


'y(a 1 ,l,a 2 ) 

= E[Y( 0 , 1 , 02 ) - Y(a\,0) \ L = l,Ai = a\, 

A -2 = 02 ] 

and 


y(ai) = E[Y (ai, 0) - T(0,0) | Ai = ai]. 

Note 7 ( 01 , l, a 2 ) is the effect of the last blip of treat¬ 
ment a 2 at time 2 among subjects with observed 
history (a\,l,a 2 ), while 7 ( 01 ) is the effect of the 
last blip of treatment a\ at time 1 among subjects 


with history a\. An additive SNMM specifies para¬ 
metric models 7 ( 01 , l, o 2 ; ip 2 ) and 7(01; Vh) f° r these 
blip functions with 7 ( 01 ; 0) = 7 ( 01 , l, a 2 ; 0) = 0. Un¬ 
der the independence assumptions (9), H. 2 (ip 2 ) x 
d(L, A 1 ){A 2 -E[A 2 I L,A 1 ]} and Hi(V>){^i -Efa]} 
are unbiased estimating functions for the true 
ip*, where H 2 (ip 2 ) = Y - j(A lt L, A 2 ; ip 2 ), H^ip) = 
H 2 (ip 2 ) ~ l(Ai‘,ipi), and d(L,A\ ) is a user-supplied 
function of the same dimension as ip 2 ■ Under the g- 
null mean hypothesis (11), the SNMM is guaranteed 
to be correctly specified with ip* = 0. Thus, these 
estimating functions when evaluated at ip* =0, can 
be used in the construction of an asymptotically a- 
level test of the g-null mean hypothesis when f(a 1 ) 
and f(a 2 \ a\,l) are known (or are consistently es¬ 
timated ). 21 When L is a high-dimensional vector, 
the parametric blip models may well be misspec- 
ified when g-null mean hypothesis is false. How¬ 
ever, because the functions 7(01 ,l,a 2 ) and 7 ( 01 ) are 
nonparametrically identified under assumptions (9), 
one can construct consistent tests of the correct¬ 
ness of the blip models ')(ai,l,a 2 ',ip 2 ) and 7 ( 01 ; ipi). 
Furthermore, one can also estimate the blip func¬ 
tions using cross-validation (Robins (2004)) and/or 
flexible machine learning methods in lieu of a pre¬ 
specified parametric model (van der Laan and Rose 
(2011)). A recent modification of a multiplicative 
SNMM, the structural nested cumulative failure 
time model, designed for censored time to event out¬ 
comes has computational advantages compared to a 
SNFTM, because, in contrast to a SNFTM, param¬ 
eters are estimated using an unbiased estimating 
function that is differentiable in the model parame¬ 
ters; see Picciotto et al. (2012). 

Robins (2004) also introduced optimal-regime 
SNNMs drawing on the seminal work of Murphy 
(2003) on semiparametric methods for the esti¬ 
mation of optimal treatment strategies. Optimal- 
regime SNNM estimation, called A-learning in com¬ 
puter science, can be viewed as a semiparametric 
implementation of dynamic programming (Bellman, 
1957). 22 Optimal-regime SNMMs differ from stan- 


21 In the literature, semiparametric estimation of the pa¬ 
rameters of a SNM based on such estimating functions is re¬ 
ferred to as “g-estimation.” 

22 Interestingly, Robins (1989, page 127 and App. 1), un¬ 
aware of Bellman’s work, reinvented the method of dynamic 
programming but remarked that, due to the difficulty of the 
estimation problem, it would only be of theoretical interest 
for finding the optimal dynamic regimes from longitudinal 
epidemiological data. 
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dard SNMMs only in that 7 ( 01 ) is redefined to be 

7 ( 01 ) = E[Y(ai,g 2 , op t(ai,L(ai))) 

~ Y(0,g 2 , op t(0,L(0))) | Ai = ai], 

where 52 ,opt(ai ,0 = argmax a2 7 ( 01 , l, a 2 ) is the op¬ 
timal treatment at time 2 given past history (ai,Z). 
The overall optimal treatment strategy g op t is then 
(ai,opt,32,opt(ai,Z)) where ai i0pt = argmax ai 7 ( 01 ). 
More on the estimation of optimal treatment regimes 
can be found in Schulte et al. (2014) in this volume. 

2.7 Instrumental Variables and Bounds for the 
Average Treatment Effect 

Robins (1989, 1993) also noted that structural 
nested models can be used to estimate treatment 
effects when assumptions (9) do not hold but data 
are available on a time dependent instrumental vari¬ 
able. As an example, patients sometimes fail to fill 
their prescriptions and thus do not comply with 
their prescribed treatment. In that case, we can take 
Aj = (Aj,Aj) for each time j, where A\ • denotes 
the treatment prescribed and A d denotes the dose 
of treatment actually received at time j. Robins 
defined A p - to be an instrumental variable if (9) 
still holds after replacing Aj by A p - and for all sub¬ 
jects Y (ai, 02) depends on aj = (a?, a d j) only through 
the actual dose a d . Robins noted that unlike the 
case of full compliance (i.e., A p = A d with probabil¬ 
ity 1 ) discussed earlier, the treatment effect func¬ 
tions 7 are not nonparametrically identified. Conse¬ 
quently, identification can only be achieved by cor¬ 
rectly specifying (sufficiently restrictive) parametric 
models for 7 . 

If we are unwilling to rely on such parametric as¬ 
sumptions, then the observed data distribution only 
implies bounds for the 7 ’s. In particular, in the set¬ 
ting of a point treatment randomized trial with non- 
compliance and the instrument A p being the as¬ 
signed treatment, Robins (1989) obtained bounds on 
the average causal effect E[Y(a c i = 1 ) — Y(a,d = 0)] 
of the received treatment Ad- To the best of our 
knowledge, this paper was the first to derive bounds 
for nonidentified causal effects defined through po¬ 
tential outcomes. 2 '"' The study of such bounds has 
become an active area of research. Other early pa¬ 
pers include Manski (1990) and Balke and Pearl 


23 See also Robins and Greenland (1989a, 1989b). 


(1994). 24 See Richardson et al. (2014) in this vol¬ 
ume for a survey of recent research on bounds. 

2.8 Limitations of Structural Nested Models 

Robins (2000) noted that there exist causal ques¬ 
tions for which SNMs are not altogether satisfac¬ 
tory. As an example, for Y binary, Robins (2000) 
proposed a structural nested logistic model in order 
to ensure estimates of the counterfactual mean of Y 
were between zero and one. However, he noted that 
knowledge of the randomization probabilities did 
not allow one to construct unbiased estimating func¬ 
tion for its parameter ip*. More importantly, SNMs 
do not directly model the final object of public 
health interest—the distribution or mean of the out¬ 
come Y as function of the regimes g —as these dis¬ 
tributions are generally functions not only of the pa¬ 
rameters of the SNM but also of the conditional law 
of the time dependent covariates L given the past 
history. In addition, SNMs constitute a rather large 
conceptual leap from standard associational regres¬ 
sion models familiar to most statisticians. Robins 
(1998, 2000) introduced a new class of causal mod¬ 
els, marginal structural models, that overcame these 
particular difficulties. Robins also pointed out that 
MSMs have their own shortcomings, which we dis¬ 
cuss below. Robins (2000) concluded that the best 
causal model to use will vary with the causal ques¬ 
tion of interest. 

2.9 Dependent Censoring and Inverse 
Probability Weighting 

Marginal Structural Models grew out of Robins’ 
work on censoring and inverse probability of censor¬ 
ing weighted (IPCW) estimators. Robins work on 
dependent censoring was motivated by the famil¬ 
iar clinical observation that patients who did not 
return to the clinic and were thus censored differed 
from other patients on important risk factors, for ex¬ 
ample measures of cardio-pulmonary reserve. In the 
1970s and 1980s, the analysis of right censored data 
was a major area of statistical research, driven by 
the introduction of the proportional hazards model 
(Cox (1972); Kalbfleisch and Prentice (1980)) and 
by martingale methods for their analysis (Aalen 
(1978); Andersen et al. (1993); Fleming and Har¬ 
rington (1991)). This research, however, was focused 


24 Balke and Pearl (1994) showed that Robins’ bounds were 
not sharp in the presence of “defiers” (i.e., subjects who would 
never take the treatment assigned) and derived sharp bounds 
in that case. 
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on independent censoring. An important insight in 
Robins (1986) was the recognition that by refram¬ 
ing the problem of censoring as a causal inference 
problem as we will now explain, it was possible to 
adjust for dependent censoring with the g-formula. 

Rubin (1978a) had pointed out previously that 
counter factual causal inference could be viewed as 
a missing data problem. Robins (1986, page 1491) 
recognized that the converse was indeed also true: a 
missing data problem could be viewed as a problem 
in counterfactual causal inference . 25 Robins concep¬ 
tualized right censoring as just another time depen¬ 
dent “treatment” At and one’s inferential goal as 
the estimation of the outcome Y under the static 
regime g “never censored.” Inference based on the 
g-formula was then licensed provided that censoring 
was explainable in the sense that ( 6 ) holds. This ap¬ 
proach to dependent censoring subsumed indepen¬ 
dent censoring as the latter is a special case of the 
former. 

Robins, however, recognized once again that in¬ 
ference based on the estimated g-formula could be 
nonrobust. To overcome this difficulty, (Robins and 
Rotnitzky, 1992) introduced IPCW tests and esti¬ 
mators whose properties are easiest to explain in 
the context of a two-armed RCT of a single treat¬ 
ment (Ai). The standard Intention-to-Treat (ITT) 
analysis for comparing the survival distributions in 
the two arms is a log-rank test. However, data are 
often collected on covariates, both pre- and post- 
randomization, that are predictive of the outcome 
as well as (possibly) of censoring. An ITT analy¬ 
sis that tries to adjust for dependent-censoring by 
IPCW uses estimates of the arm-specific hazards of 
censoring as functions of past covariate history. The 
proposed IPCW tests have the following two advan¬ 
tages compared to the log rank test. First, if cen¬ 
soring is dependent but explainable by the covari¬ 
ates, the log-rank test is not asymptotically valid. In 
contrast, IPCW tests asymptotically reject at their 
nominal level provided the arm-specific hazard es¬ 
timators are consistent. Second, when censoring is 
independent, although both the IPCW tests and the 
log-rank test asymptotically reject at their nominal 
level, the IPCW tests, by making use of covariates, 
can be more powerful than the log-rank test even 


25 A viewpoint recently explored by Mohan, Pearl and Tian 
(2013). 


against proportional-hazards alternatives. Even un¬ 
der independent censoring tests based on the esti¬ 
mated g-formula are not guaranteed to be asymp¬ 
totically a-level, and hence are not robust. 

To illustrate, we consider here an RCT with 
A\ being the randomization indicator, L a post¬ 
randomization covariate, A 2 the indicator of cen¬ 
soring and Y the indicator of survival. For simplic¬ 
ity, we assume that any censoring occurs at time 2 
and that there are no failures prior to time 2. The 
IPCW estimator £ of the ITT effect /3* = E[Y \ A = 
1] — E\Y | A = 0] is defined as the solution to 

(12) P n [I(A 2 = 0 )U(P)/Pt(A 2 = 0 I L, A!)] = 0, 

where U((3) = (Y — j3Ai)(A\ — 1/2), throughout P„. 
denotes the empirical mean operator and Pr(A .2 = 
0 | L,A\) is an estimator of the arm-specific condi¬ 
tional probability of being uncensored. When first 
introduced in 1992, IPCW estimators, even when 
taking the form of simple Horvitz-Thompson esti¬ 
mators, were met with both surprise and suspicion 
as they violated the then widely held belief that one 
should never adjust for a post-randomization vari¬ 
able affected by treatment in a RCT. 

2.10 Marginal Structural Models 

Robins (1993, Remark Al.3, pages 257-258) noted 
that, for any treatment regime g, if randomization 
w.r.t. Y, that is, (9), holds, Pr{T(g) > y} can be 
estimated by IPCW if one defines a person’s cen¬ 
soring time as the first time he/she fails to take 
the treatment specified by the regime. In this set¬ 
ting, he referred to IPCW as inverse probability of 
treatment weighted (IPTW). In actual longitudinal 
data in which either (i) treatment A & is measured at 
many times k or (ii) the A *. are discrete with many 
levels or continuous, one often finds that few study 
subjects follow any particular regime. In response, 
Robins (1998, 2000) introduced MSMs. These mod¬ 
els address the aforementioned difficulty by borrow¬ 
ing information across regimes. Additionally, MSMs 
represent another response to the g-null paradox 
complementary to Structural Nested Models. 

To illustrate, suppose that in our example of Sec¬ 
tion 2, A\ and A 2 now have many levels. An in¬ 
stance of an MSM for the counterfactual means 
E[Y(ai,a 2 )\ is a model that specifies that 

$- 1 {E\Y(a 1 ,a 2 )]} = ft+'y(a 1 ,ar,ft), 
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where <h ~ 1 is a given link function such as the logit, 
log, or identity link and 7(01 ,a 2 ;Pi) is a known func¬ 
tion satisfying 7 ( 01 , 02 ; 0) = 0. In this model, /3i = 0 
encodes the static-regime mean null hypothesis that 

(13) Hq : E[Y(a\, 02 )] is the same for all ( 01 , 02 ). 

Robins (1998) proposed IPTW estimators (/3o,/?i) 
of (/?q , /3J). When the treatment probabilities are 
known, these estimators are defined as the solution 
to 

P n [Wv(A 1 ,A 2 )(Y - T{/3 0 + 7 (^ 1 , ^ 2 ; ft)})] 

(14) 

= 0 

for a user supplied vector function v(Ai,A 2 ) of the 
dimension of (/5 q , /^i ) where 

W = l/{f(A 1 )f(A 2 \A 1 ,L)}. 

Informally, the product f(A\)f(A 2 \ A\,L) is the 
“probability that a subject had the treatment his¬ 
tory he did indeed have .” 26 When the treatment 
probabilities are unknown, they are replaced by es¬ 
timators. 

Intuitively, the reason why the estimating func¬ 
tion of (14) has mean zero at (/3q,/3*) is as fol¬ 
lows: Suppose the data had been generated from a 
sequentially randomized trial represented by DAG 
in Figure 2. We may create a pseudo-population 
by making l/{f(Ai)f(A 2 \ A\,L)} copies of each 
study subject. It can be shown that in the resulting 
pseudo-population A 2 _IL {L, Ai}, and thus is repre¬ 
sented by the DAG in Figure 2 , except with both 
arrows into A 2 removed. In the pseudo-population, 
treatment is completely randomized (i.e., there is 
no confounding by either measured or unmeasured 
variables), and hence causation is association. Fur¬ 
ther, the mean of R(ai,a 2 ) takes the same value in 
the pseudo-population as in the actual population. 
Thus if, for example, j(ai, a 2 ; /3i) = /3i,iai + /3i t2 a 2 
and <h _1 is the identity link, we can estimate (/5 q , ) 

by OLS in the pseudo-population. However, OLS 
in the pseudo-population is precisely weighted least 


26 IPTW estimators and IPCW estimators are essentially 
equivalent. For instance, in the censoring example of Sec¬ 
tion 2.9, on the event A 2 = 0 of being uncensored, the 
IPCW denominator pr(A 2 = 0 | L, A\) equals f{A 2 | A\,L), 
the IPTW denominator. 


squares in the actual study population with weights 
l/{f(A 1 )f(A 2 \A 1 ,L)}* 

Robins (2000, Section 4.3) also noted that the 
weights W can be replaced by the so-called stabi¬ 
lized weights SW = {f(A 1 )f(A 2 \ Ai)}/{f(Ai)f(A 2 
Ai,L)}, and described settings where, for efficiency 
reasons, using SW is preferable to using W. 

MSMs are not restricted to models for the depen¬ 
dence of the mean of Y(ai,a 2 ) on ( 01 , 02 ). Indeed, 
one can consider MSMs for the dependence of any 
functional of the law of Y(ai,a 2 ) on ( 01 , 02 ), such as 
a quantile or the hazard function if Y is a time-to- 
event variable. If the study is fully randomized, that 
is, (1) holds, then an MSM model for a given func¬ 
tional of the law of Y(a\ } a 2 ) is tantamount to an 
associational model for the same functional of the 
law of Y conditional on A\ = ai and A 2 = a 2 . Thus, 
under (1), the MSM model can be estimated using 
standard methods for estimating the corresponding 
associational model. If the study is only sequentially 
randomized, that is, (6) holds but (1) does not, then 
the model can still be estimated by the same stan¬ 
dard methods but weighting each subject by W or 
SW. 

Robins (2000) discussed disadvantages of MSMs 
compared to SNMs. Here, we summarize some of 
the main drawbacks. Suppose (9) holds for all g G G. 
If the g-null hypothesis (10) is false but the static 
regime null hypothesis that the law of ^( 01 , 02 ) is 
the same for all ( 01 , 02 ) is true, then by (iii) of 
Section 2.6, f(y \ A\ = a\, A 2 = a 2 , L = l) will de¬ 
pend on a 2 for some stratum (a±,l) thus imply- 


27 More formally, recall that under (6), E[Y(ai,a 2 )} = 
^{do +7( a ii a 2;/3i)} is equal to the g-formula f yf* a ^ ( y)dy. 
Now, given the joint density of the data f(A \, L, A 2 , Y), define 

f(AuL,A 2 ,Y) = f(X | A 1 ,L,A 2 )MA 2 )f(L \ AJMAJ, 

where fi(Ai)f 2 (A 2 ) are user-supplied densities chosen so that 
/ is absolutely continuous with respect to /. Since the g- 
formula depends on the joint density of the data only through 
f(Y | Ai, L, A2) and f(L \ Ai), then it is identical under / and 
under /. Furthermore, for each m, 02 the g-formula under / 
is just equal to E[Y \ A\ = ai, A 2 = 02 ] since, under /, A2 is 
independent of {L,A\}. Consequently, for any q{A\, A2) 

0 = E[q(A 1: A 2 )(Y - 4{/3o + 7 (^ 1 , A 2 ; A*)})] 

= E[q(A 1 ,A2){f(A 1 )f(A2)/{f(A 1 )f(A a | A^L)}} 

•(y-4.{^+ 7 (A 1 ,A 2 ;/3 1 *)})], 

where the second equality follows from the Radon-Nikodym 
theorem. The result then follows by taking q{A\,A 2 ) = 

v(A 1 ,A 2 )/{f(A 1 )f(A 2 )}. 
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ing a causal effect of A 2 in that stratum; estima¬ 
tion of an SNM model would, but estimation of an 
MSM model would not, detect this effect. A second 
drawback is that estimation of MSM models, suffers 
from marked instability and finite-sample bias in the 
presence of weights W that are highly variable and 
skewed. This is not generally an issue in SNM es¬ 
timation. A third limitation of MSMs is that when 
( 6 ) fails but an instrumental variable is available, 
one can still consistently estimate the parameters of 
a SNM but not of an MSM . 28 

An advantage of MSMs over SNMs that was not 
discussed in Section 2.8 is the following. MSMs can 
be constructed that are indexed by easily inter¬ 
pretable parameters that quantify the overall effects 
of a subset of all possible dynamic regimes (Hernan 
et al. (2006); van der Laan and Petersen (2007); 
Orellana, Rotnitzky and Robins (2010a, 2010b). As 
an example consider a longitudinal study of HIV 
infected patients with baseline CD4 counts exceed¬ 
ing 600 in which we wish to determine the optimal 
CD4 count at which to begin anti-retroviral treat¬ 
ment. Let g x denote the dynamic regime that speci¬ 
fies treatment is to be initiated the first time a sub¬ 
ject’s CD4 count falls below x, x € {1,2,..., 600}. 
Let Y{g x ) be the associated counterfactual response 
and suppose few study subjects follow any given 
regime. If we assume E[Y(g x )\ varies smoothly with 
x, we can specify and fit (by IPTW) a dynamic 
regime MSM model E[Y{g x )\ = /3q + / 3* T h(x ) where, 
say, h{x) is a vector of appropriate spline functions. 

3. DIRECT EFFECTS 

Robins’ analysis of sequential regimes leads imme¬ 
diately to the consideration of direct effects. Thus, 
perhaps not surprisingly, all three of the distinct di¬ 
rect effect concepts that are now an integral part 
of the causal literature are all to be found in his 
early papers. Intuitively, all the notions of direct ef¬ 
fect consider whether “the outcome (V) would have 
been different had cause (Ai) been different, but the 
level of (A 2 ) remained unchanged.” The notions dif¬ 
fer regarding the precise meaning of A 2 “remained 
unchanged.” 

3.1 Controlled Direct Effects 

In a setting in which there are temporally or¬ 
dered treatments A\ and A 2 , it is natural to wonder 


28 Note that, as observed earlier, in this case identification is 
achieved through parametric assumptions made by the SNM. 


whether the first treatment has any effect on the 
final outcome were everyone to receive the second 
treatment. Formally, we wish to compare the poten¬ 
tial outcomes Y(a± = l,a 2 = 1) and Y(a± = 0,a 2 = 
1). Robins (1986, Section 8) considered such con¬ 
trasts, that are now referred to as controlled direct 
effects. More generally, the average controlled direct 
effect of A\ on Y when A 2 is set to 02 is defined to 
be 

(15) CDE(a 2 ) = E[Y (ai = 1, a 2 ) — Y{a\ = 0, a 2 )], 

where Y(a\ = 1, a 2 ) — Y{a\ = 0, a 2 ) is the individual 
level direct effect. Thus, if A 2 takes fc-levels then 
there are k such contrasts. 

Under the causal graph shown in Figure 5(a), in 
contrast to Figures 2 and 3, the effect of A 2 on Y is 
unconfounded, by either measured or unmeasured 
variables, association is causation and thus, under 
the associated FFRCISTG model: 

CDE(a 2 ) = E[Y | Ai = 1, A 2 = a 2 ] 

— E[Y | Ai = 0, A 2 = a 2 ]. 

The CDE can be identified even in the presence 
of time-dependent confounding. For example, in the 
context of the FFRCISTG associated with either 
of the causal DAGs shown in Figures 2 and 3, 
the CDE(a 2 ) will be identified via the difference in 
the expectations of Y under the g-formula densities 

fa!=l,a 2 (y) and /a 1= 0,a 2 (y)- 29 

The CDE requires that the potential outcomes 
V(ai,a 2 ) be well-defined for all values of a\ and a 2 . 
This is because the CDE treats both A 2 and A\ as 
causes, and interprets “A 2 remained unchanged” to 
mean “had there been an intervention on A 2 fixing 
it to o 2 .” 

This clearly requires that the analyst be able to 
describe a well-defined intervention on the mediat¬ 
ing variable A 2 . 

There are many contexts in which there is no clear 
well-defined intervention on A 2 and thus it is not 
meaningful to refer to V(ai,a 2 ). The CDE is not 
applicable in such contexts. 

3.2 Principal Stratum Direct Effects (PSDE) 

Robins (1986) considered causal contrasts in the 
situation described in Section 2.9 in which death 
from a disease of interest, for example, a heart at¬ 
tack, may be censored by death from other diseases. 


29 See (7). 




14 


T. S. RICHARDSON AND A. ROTNITZKY 




Fig. 5. (a) A causal DAG Q with no (measured or unmeasured) confounding of A 2 on Y; (b) the SWIG 0 ( 01 , 02 ) resulting 

from intervening on Ai and A 2 . 


To describe these contrasts, we suppose A\ is a 
treatment of interest, Y = 1 is the indicator of death 
from the disease of interest (in a short interval sub¬ 
sequent to a given fixed time t) and A 2 = 0 is the “at 
risk indicator” denoting the absence of death either 
from other diseases or the disease of interest prior 
to time t. 

Earlier Kalbfleisch and Prentice (1980) had ar¬ 
gued that if A 2 = 1 , so that the subject does not 
survive to time t, then the question of whether the 
subject would have died of heart disease subsequent 
to t had death before t been prevented is meaning¬ 
less. In the language of counterfactuals, they were 
saying (i) that if A\ = ai and A 2 = A 2 (a\) = 1, the 
counterfactual Y (ai, 02 = 0) is not well-defined and 
(ii) the counterfactual Y (a\,a 2 = 1) is never well- 
defined. 

Robins (1986, Section 12.2) observed that if one 
accepts this then the only direct effect contrast that 
is well-defined is Y (ai = 1,02 = 0 ) — Y(a\ = 0,02 = 
0 ) and that is well-defined only for those subjects 
who would survive to t regardless of whether they re¬ 
ceived ai = 0 or ai = 1. In other words, even though 
E(< 21 , 02 ) may not be well-defined for all subjects 
and all a\, 02 , the contrast: 

E[Y (01 = 0, o 2 ) — Y(ai = 1, o 2 ) | 

(16) 

^ 2(01 = 1) = A 2 {a\ = 0 ) = a 2 ] 

is still well-defined when <22 = 0. As noted by Robins, 
this could provide a solution to the problem of defin¬ 
ing the causal effect of the treatment A\ on the out¬ 
come Y in the context of censoring by death due to 
other diseases. 

Rubin (1998) and Frangakis and Rubin (1999, 
2002 ) later used this same contrast to solve precisely 
the same problem of “censoring by death .” 30 


30 The analysis of Rubin (2004) was also based on this con¬ 
trast, with A 2 no longer a failure time indicator so that the 
contrast (16) could be considered as well-defined for any value 
of a 2 for which the conditioning event had positive probabil¬ 
ity. 


In the terminology of Frangakis and Rubin (2002) 
for a subject with A 2 (a\ = 1) = A 2 {a\ = 0) = a 2 , the 
individual principal stratum direct effect is defined 
to be : 31 

Y (ai = 1, a 2 ) -Y(a x = 0, a 2 ) 

(here, A\ is assumed to be binary). The average 
PSDE in principal stratum <22 is then defined to be 

PSDE(a 2 ) = E[Y(a 1 = 1 , a 2 ) - Y(oi = 0 , a 2 ) | 

A 2 (ai = 1) = A 2 (ai = 0 ) = a 2 \ 
(17) 

= E[Y (ai = 1) — Y(ai = 0) | 

A2 (®i = 1 ) = 21.2(01 = 0 ) = <22], 

where the second equality here follows, since Y{a\, 
A 2 {a{)) = Y(ai ). 32 In contrast to the CDE, the 
PSDE has the advantage that it may be defined, via 
(17), without reference to potential outcomes involv¬ 
ing intervention on < 22 - Whereas the CDE views A 2 
as a treatment, the PSDE treats A 2 as a response. 
Equivalently, this contrast interprets “had A 2 re¬ 
mained unchanged” to mean “we restrict attention 
to those people whose value of A 2 would still have 
been a 2 , even under an intervention that set Ai to 
a different value.” 

Although the PSDE is an interesting parame¬ 
ter in many settings (Gilbert, Bosch and Hudgens 
(2003)), it has drawbacks beyond the obvious (but 
perhaps less important) ones that neither the pa¬ 
rameter itself nor the subgroup conditioned on are 
nonparametrically identified. In fact, having just 
defined the PSDE parameter, Robins (1986) crit¬ 
icized it for its lack of transitivity when there is 
a non-null direct effect of A\ and A\ has more 
than two levels; that is, for a given a 2 , the PS- 
DEs comparing 01 = 0 with a\ = 1 and a\ = 1 


31 For subjects for whom ARau = 1) ^ ^ 2 ( 0:1 = 0), no prin¬ 
cipal stratum direct effect (PSDE) is defined. 

32 This follows from consistency. 
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with ai = 2 may both be positive but the PSDE 
comparing a\ = 0 with a\ = 2 may be negative. 
Robins, Rotnitzky and Vansteelandt (2007) noted 
that the PSDE is undefined when Ai has an effect 
on every subject’s A 2 , a situation that can easily 
occur if A 2 is continuous. In that event, a natural 
strategy would be to, say, dichotomize A 2 . However, 
Robins, Rotnitzky and Vansteelandt (2007) showed 
that the PSDE in principal stratum a\ of the di¬ 
chotomized variable may fail to retain any mean¬ 
ingful substantive interpretation. 

3.3 Pure Direct Effects (PDE) 33 

Once it has been established that a treatment A\ 
has a causal effect on a response Y, it is natural to 
ask what “fraction” of a the total effect may be at¬ 
tributed to a given causal pathway. As an example, 
consider a RCT in nonhypertensive smokers of the 
effect of an anti-smoking intervention (Ai) on the 
outcome myocardial infarction (MI) at 2 years (Y). 
For simplicity, assume everyone in the intervention 
arm and no one in the placebo arm quit cigarettes, 
that all subjects were tested for new-onset hyper¬ 
tension A 2 at the end of the first year, and no sub¬ 
ject suffered an MI in the first year. Hence, A\, A 2 
and Y occur in that order. Suppose the trial showed 
smoking cessation had a beneficial effect on both 
hypertension and MI. It is natural to consider the 
query: “What fraction of the total effect of smoking 
cessation A\ on MI Y is through a pathway that 
does not involve hypertension A 2 ?” 

Robins and Greenland (1992) formalized this 
question via the following counterfactual contrast, 
which they termed the “pure direct effect”: 

Y{ai = 1, A 2 (ai = 0)} - Y{ai = 0, A 2 (oi = 0)}. 

The second term here is simply Y(a± = 0). 3i The 
contrast is thus the difference between two quanti¬ 
ties: first, the outcome Y that would result if we 
set a\ to 1, while “holding fixed” a 2 at the value 
A 2 (ai = 0) that it would have taken had a\ been 
0; second, the outcome Y that would result from 
simply setting a\ to 0 [and thus having A 2 again 
take the value A 2 (ai = 0)]. Thus, the Pure Direct 
Effect interprets had “A 2 remained unchanged” to 
mean “had (somehow) A 2 taken the value that it 


33 Pearl (2001) adopted the definition given by Robins and 
Greenland (1992) but changed nomenclature. He refers to the 
pure direct effect as a “natural” direct effect. 

34 This follows by consistency. 


would have taken had we fixed A\ to 0.” The con¬ 
trast thus represents the effect of A\ on Y had the 
effect of A\ on hypertension A 2 been blocked. As 
for the CDE, to be well-defined, potential outcomes 
V(ai,a 2 ) must be well-defined. As a summary mea¬ 
sure of the direct effect of (a binary variable) A\ on 
Y, the PDE has the advantage (relative to the CDE 
and PSDE) that it is a single number. 

The average pure direct effect is defined as 35 

PDE = E[Y {ai = 1, A 2 (ai = 0)}] 

- E[Y (ai = 0, A 2 (ai = 0))]. 

Thus, the ratio of the PDE to the total effect 
E[Y{ai = 1}] — E[Y{ai = 0}] is the fraction of the 
total that is through a pathway that does not involve 
hypertension (A 2 ). 

Unlike the PSDE, the PDE is an average over 
the full population. However, unlike the CDE, the 
PDE is not nonparametrically identified under the 
FFRCISTG model associated with the simple DAG 
shown in Figure 5(a). Robins and Richardson (2011, 
App. C) computed bounds for the PDE under the 
FFRCISTG associated with this DAG. 

Pearl (2001) obtains identification of the PDE un¬ 
der the DAG in Figure 5(a) by imposing stronger 
counterfactual independence assumptions, via a 
Nonparametric Structural Equation Model with In¬ 
dependent Errors (NPSEM-IE). 36 Under these as¬ 
sumptions, Pearl (2001) obtains the following iden¬ 
tifying formula: 

J2{E[Y\A 1 = l,A 2 = a 2 } 

a 2 


35 Robins and Greenland (1992) also defined the total indi¬ 
rect effect (TIE) of Ai on Y through A 2 to be 

E[Y{a 1 = 1, A a (ai = 1)}] - E[Y{a 1 = 1, A 2 (m = 0)}]. 

It follows that the total effect E[Y{ai = 1}] — E[Y{ai = 0}] 
can then be decomposed as the sum of the PDE and the TIE. 

36 In more detail, the FFRCISTG associated with Fig¬ 
ures 5(a) and (b) assumes for all ai, a 2 , 

(18) Y(ai,a 2 ),A 2 (ai) AL Ai, Y(ai,a 2 ) AL A 2 (ai)\ A\, 

which may be read directly from the SWIG shown in Fig¬ 
ure 5(b); recall that red nodes are always blocked when apply¬ 
ing d-separation. In contrast, Pearl’s NPSEM-IE also implies 
the independence 

(19) Y(ai,a 2 ) ALA 2 (at) \ Ai, 

when a\ A a i- Independence (19), which is needed in order 
for the PDE to be identified, is a “cross-world” independence 
since Y(ai,a 2 ) and A 2 (al) could never (even in principle) 
both be observed in any randomized experiment. 




16 


T. S. RICHARDSON AND A. ROTNITZKY 


(20) — E\Y | A\ — 0, A 2 — 02]} 

• P(A 2 = a 2 | A\ = 0), 

which he calls the “Mediation Formula.” 

Robins and Richardson (2011) noted that the 
additional assumptions made by the NPSEM-IE 
are not testable, even in principle, via a random¬ 
ized experiment. Consequently, this formula rep¬ 
resents a departure from the principle, originating 
with Neyman (1923), that causation be reducible 
to experimental interventions, often expressed in 
the slogan “no causation without manipulation.” 37 
Robins and Richardson (2011) achieve a rapproche¬ 
ment between these opposing positions by showing 
that the formula (20) is equal to the g-formula asso¬ 
ciated with an intervention on two treatment vari¬ 
ables not appearing on the graph (but having de¬ 
terministic relations with Ai ) under the assumption 
that one of the variables has no direct effect on A 2 
and the other has no direct effect on Y. Hence, under 
this assumption and in the absence of confounding, 
the effect of this intervention on Y is point identified 
by (20). 38 

Although there was a literature on direct ef¬ 
fects in linear structural equation models (see, e.g., 
Blalock (1971)) that preceded Robins (1986) and 
Robins and Greenland (1992), the distinction be¬ 
tween the CDE and PDE did not arise since in linear 
models these notions are equivalent. 39 


37 A point freely acknowledged by Pearl (2012) who argues 
that causation should be viewed as more primitive than in¬ 
tervention. 

38 This point identification is not a “free lunch”: 
Robins and Richardson (2011) show that it is these additional 
assumptions that have reduced the FFRCISTG bounds for 
the PDE to a point. This is a consequence of the fact that 
these assumptions induce a model for the original variables 
{Ai, Ai{ai), Y(ai, a, 2 )} that is a strict submodel of the origi¬ 
nal FFRCISTG model. 

Hence to justify applying the mediation formula by this 
route one must first be able to specify in detail the additional 
treatment variables and the associated intervention so as to 
make the relevant potential outcomes well-defined. In addi¬ 
tion, one must be able to argue on substantive grounds for 
the plausibility of the required no direct effect assumptions 
and deterministic relations. 

It should also be noted that even under Pearl’s NPSEM- 
IE model the PDE is not identified in causal graphs, such 
as those in Figures 2 and 3 that contain a variable (whether 
observed or unobserved) that is present both on a directed 
pathway from Ai to Ai and on a pathway from Ai to Y. 

39 Note that in a linear structural equation model the PSDE 
is not defined unless Ai has no effect on Ai. 


3.4 The Direct Effect Null 

Robins (1986, Section 8) considered the null hy¬ 
pothesis that Y(ai,a 2 ) does not depend on a\ for 
all a 2 , which we term the sharp null-hypothesis of 
no direct effect of A\ on Y (relative to A 2 ) or more 
simply as the “sharp direct effect null.” 

In the context of our running example with data 
(Ai,L, A 2 ,Y), under (6) the sharp direct effect null 
implies the following constraint on the observed data 
distribution: 

(21) /* (y) is not a function of a\ for all a 2 . 

Robins (1986, Sections 8 and 9) noted that this con¬ 
straint (21) is not a conditional independence. This 
is in contrast to the y-null hypothesis which we have 
seen is equivalent to the independencies in (ii) of 
Section 2.6 [when equation (9) holds for all g £ O]. 40 
He concluded that, in contrast to the g-null hypoth¬ 
esis, the constraint (21), and thus the sharp direct 
effect null, cannot be tested using case control data 
with unknown case and control sampling fractions. 41 
This constraint (21) was later independently discov¬ 
ered by Verma and Pearl (1990) and for this reason 
is called the “Verma constraint” in the Computer 
Science literature. 

Robins (1999b) noted that, though (21) is not a 
conditional independence in the observed data dis¬ 
tribution, it does correspond to a conditional in¬ 
dependence, but in a weighted distribution with 
weights proportional to l/f(A 2 \ A\,L). This can 
be understood from the informal discussion follow¬ 
ing equation (14) in the previous section: there it 
was noted that given the FFRCISTG corresponding 
to the DAG in Figure 2, reweighting by l/f(A 2 \ 
A\,L) corresponds to removing both edges into A 2 . 
Hence, if the edges A\ —>• Y and L —>• Y are not 
present, so that the sharp direct effect null holds, 
as in Figure 6(a), then the reweighted population is 


40 Results in Pearl (1995b) imply that under the sharp di¬ 
rect effect null the FFRCISTGs associated with the DAGs 
shown in Figures 2 and 3 also imply inequality restrictions 
similar to Boll’s inequality in Quantum Mechanics. See Gill 
(2014) for discussion of statistical issues arising from experi¬ 
mental tests of Bell’s inequality. 

41 To our knowledge, it is the first such causal null hypoth¬ 
esis considered in Epidemiology for which this is the case. 

42 This observation motivated the development of graphical 
“nested” Markov models that encode constraints such as (21) 
in addition to ordinary conditional independence relations; 
see the discussion of “Causal Discovery” in Section 7 below. 
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Fig. 6. (a) A DAG representing the sequentially randomized experiment shown in Figure 2 but where there is no direct effect 

of A\ on Y relative to A 2 ; (b) a DAG representing the pseudo-population obtained by re-weighting the distribution with weights 
proportional to l/f(A 2 \ L,Af). 


described by the DAG in Figure 6 (b). It then fol¬ 
lows from the d-separation relations on this DAG 
that Y _1L A\ | A 2 in the reweighted distribution. 

This fact can also be seen as follows. If, in our 
running example from Section 2 . 2 , A\, A 2 , Y are 
all binary, the sharp direct effect null implies that 
/?i* = /? 3 *=0 in the saturated MSM with 

^~ 1 {E[Y(a 1 ,a 2 )}} = A) + A*«i + P 2 a 2 + $a,a 2 . 

Since Al and /3| are the associational parameters 
of the weighted distribution, their being zero im¬ 
plies the conditional independence Y _1L A\ \ A 2 un¬ 
der this weighted distribution. 

In more complex longitudinal settings, with the 
number of treatment times k exceeding 2 , all the 
parameters multiplying terms containing a particu¬ 
lar treatment variable in a MSM may be zero, yet 
there may still be evidence in the data that the sharp 
direct effect null for that variable is false. This is di¬ 
rectly analogous to the limitation of MSMs relative 
to SNMs with regard to the sharp null hypothesis 
( 10 ) of no effect of any treatment that we noted at 
the end of Section 2.10. To overcome this problem, 
Robins (1999b) introduced direct effect structural 
nested models. In these models, which involve treat¬ 
ment at k time points, if all parameters multiplying 
a given a 3 take the value 0 , then we can conclude 
that the distribution of the observables do not refute 
the natural extension of (21) to k times. The latter 
is implied by the sharp direct effect null that aj has 
no direct effect on Y holding cij + \ ,..., a& fixed. 

4. THE FOUNDATIONS OF STATISTICS AND 
BAYESIAN INFERENCE 

Robins and Ritov (1997) and Robins and Wasser- 
man ( 2000 ) recognized that the lack of robustness 
of estimators based on the g-formula in a sequen¬ 
tial randomized trial with known randomization 
probabilities had implications for the foundations 


of statistics and for Bayesian inference. To make 
their argument transparent, we will assume in our 
running example (from Section 2.2) that the den¬ 
sity of L is known and that A\ = 1 with probabil¬ 
ity 1 (hence we drop Ai from the notation). We 
will further assume the observed data are n i.i.d. 
copies of a random vector (L,A 2 ,Y) with A 2 and 
Y binary and L a d x 1 continuous vector with 
support on the unit cube (0, l) d . We consider a 
model for the law of ( L,A 2 ,Y ) that assumes that 
the density f*(l) of L is known, that the treat¬ 
ment probability ir*(l) = Pr(A .2 = 1 | L = l) lies in 
the interval (c, 1 — c) for some known c > 0 and that 
b*(l,a 2 ) = E[Y | L = l,A 2 = a 2 ] is continuous in l. 
Under this model, the likelihood function is 

( 22 ) C(b,7T)=C 1 (b)C 2 (7T), 

where 

n 

Ci(b) = Hf*(L i )b(L i ,A 2 , i ) Y 

(23) 

•{1 -KL^A^)} 1 ^ , 

n 

(24) C 2 { vr) = H7r 2 (L i ) A *d{ 1 _ 

1=1 

and ( 6 , n) € B x II. Here B is the set of continuous 
functions from ( 0 , l) d x { 0 , 1 } to ( 0 , 1 ) and II is the 
set of functions from ( 0 , l) d to (c, 1 — c). 

We assume the goal is inference about //( 6 ) where 
fi(b) = f b(l , 1 )f*(l) dl. Under randomization, that is 
(3) and (4), fi(b*) is the counterfactual mean of Y 
when treatment is given at both times. 

When 7 T* is unknown, Robins and Ritov (1997) 
showed that no estimator of //(&*) exists that is uni¬ 
formly consistent over all B x II. They also showed 
that even if 7 r* is known, any estimator that does not 
use knowledge of n* cannot be uniformly consistent 
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over B x {vr*} for all it*. However, there do exist es¬ 
timators that depend on n* that are uniformly y/n- 
consistent for //(&*) over B x {7r*} for all 7 r*. The 
Horvitz-Thompson estimator ~P n {A 2 Y/ it* (L)} is a 
simple example. 

Robins and Ritov (1997) concluded that, in this 
example, any method of estimation that obeys the 
likelihood principle such as maximum likelihood or 
Bayesian estimation with independent priors on b 
and 7T, must fail to be uniformly consistent. This 
is because any procedure that obeys the likelihood 
principle must result in the same inference for fi(b*) 
regardless of 7 r*, even when 7 r* becomes known. 
Robins and Wasserman (2000) noted that this ex¬ 
ample illustrates that the likelihood principle and 
frequentist performance can be in severe conflict in 
that any procedure with good frequentist properties 
must violate the likelihood principle . 43 Ritov et al. 
(2014) in this volume extends this discussion in 
many directions. 

5. SEMIPARAMETRIC EFFICIENCY AND 
DOUBLE ROBUSTNESS IN MISSING DATA 
AND CAUSAL INFERENCE MODELS 

Robins and Rotnitzky (1992) recognized that the 
inferential problem of estimation of the mean E[Y ( 5 )] 
(when identified by the g-formula) of a response Y 
under a regime g is a special case of the general 
problem of estimating the parameters of an arbi¬ 
trary semi-parametric model in the presence of data 
that had been coarsened at random (Heitjan and 
Rubin (1991)). 44 


43 In response Robins (2004, Section 5.2) offered a Bayes- 
frequentist compromise that combines honest subjective 
Bayesian decision making under uncertainty with good fre¬ 
quentist behavior even when, as above, the model is so large 
and the likelihood function so complex that standard (un¬ 
compromised) Bayes procedures have poor frequentist per¬ 
formance. The key to the compromise is that the Bayesian 
decision maker is only allowed to observe a specified vector 
function of X [depending on the known 7r*(X)] but not X 
itself. 

44 Given complete data X, an always observed coarsening 
variable R, and a known coarsening function X( r ) = c(r,x), 
coarsening at random (CAR) is said to hold if Pr (R = r \ 
A') depends only on A'( r ), the observed data part of X. 
Robins and Rotnitzky (1992), Gill, van der Laan and Robins 
(1997) and Cator (2004) showed that in certain models as¬ 
suming CAR places no restrictions on the distribution of the 
observed data. For such models, we can pretend CAR holds 
when our goal is estimation of functionals of the observed 
data distribution. This trick often helps to derive efficient es- 


This viewpoint led them to recognize that the 
IPCW and IPTW estimators described earlier were 
not fully efficient. To obtain efficient estimators, 
Robins and Rotnitzky (1992) and Robins, Rot¬ 
nitzky and Zhao (1994) used the theory of semi- 
parametric efficiency bounds (Bickel et al. (1993); 
van der Vaart (1991)) to derive representations for 
the efficient score, the efficient influence function, 
the semiparametric variance bound, and the influ¬ 
ence function of any asymptotically linear estima¬ 
tor in this general problem. The books by Tsiatis 
(2006) and by van der Laan and Robins (2003) pro¬ 
vide thorough treatments. The generality of these 
results allowed Robins and his principal collabora¬ 
tors Mark van der Laan and Andrea Rotnitzky to 
solve many open problems in the analysis of semi¬ 
parametric models. For example, they used the ef¬ 
ficient score representation theorem to derive lo¬ 
cally efficient semiparametric estimators in many 
models of importance in biostatistics. Some exam¬ 
ples include conditional mean models with miss¬ 
ing regressors and/or responses (Robins, Rotnitzky 
and Zhao (1994); Rotnitzky and Robins (1995)), bi¬ 
variate survival (Quale, van der Laan and Robins 
(2006)) and multivariate survival models with ex¬ 
plainable dependent censoring (van der Laan, Hub¬ 
bard and Robins (2002 )). 45 


timators of the functional. In this section, we assume that 
the distribution of the observables is compatible with CAR, 
and further, that in the estimation problems that we consider, 
CAR may be assumed to hold without loss of generality. 

In fact, this is the case in the context of our run¬ 
ning causal inference example from Section 2.2. Specifi¬ 
cally, let X — {y(ai, 02 ),T(ai);aj G {0,1}, j = 1,2}, R = 
(Ai,A 2 ), and X (ai>a2) = {Y(ai, 02 ), L(ai)}. Consider a 
model Mx for X that specifies (i) {l'(l, a 2 ), L(l); 02 € 
{0,1}} JL {Y (0,a 2 ),L(0);a 2 € {0,1}} and (ii) Y(ai,l) JL 
Y(ai,0) | L(a\) for ai G {0,1}. Results in Gill and Robins 
(2001, Section 6) and Robins (2000, Sections 2.1 and 4.2) 
show that (a) model Mx places no further restrictions 
on the distribution of the observed data (Ai, A 2 , L,Y) = 
(Ai,A 2 ,L(Ai),Y(Ai,A 2 )), (b) given model Mx, the addi¬ 
tional independences A JL Ai and X JL A 2 \ A \, L together 
also place no further restrictions on the distribution of the 
observed data (Ai, A 2 , L,Y) and are equivalent to assuming 
CAR. Further, the independences in (b) imply (9) so that 
fY(g){y) is identified by the g-formula f*(y). 

45 More recently, in the context of a RCT, Tsiatis et al. 
(2008) and Moore and van der Laan (2009), following the 
strategy of Robins and Rotnitzky (1992), studied variants of 
the locally efficient tests and estimators of Scharfstein, Rot¬ 
nitzky and Robins (1999) to increase efficiency and power by 
utilizing data on covariates. 
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In coarsened at random data models, whether 
missing data or causal inference models, locally ef¬ 
ficient semiparametric estimators are also doubly 
robust (Scharfstein, Rotnitzky and Robins (1999), 
pages 1141-1144) and (Robins and Rotnitzky (2001)) 
See the book (van der Laan and Robins (2003)) 
for details and for many examples of doubly ro¬ 
bust estimators. Doubly robust estimators had 
been discovered earlier in special cases. In fact, 
Firth and Bennett (1998) note that the so-called 
model-assisted regression estimator of a finite popu¬ 
lation mean of Cassel, Sarndal and Wretman (1976) 
is design consistent which is tantamount to being 
doubly robust. See Robins and Rotnitzky (2001) for 
other precursors. 

In the context of our running example, from Sec¬ 
tion 2.2, suppose (6) holds. An estimator fidr of 
h = E[Y(a 1 ,a 2 )\ = /* 1>a2 (l) for, say ai = a 2 = 1, is 
said to be doubly robust (DR) if it is consistent when 
either (i) a model for i t(L) = Pr(A 2 = 1 | A\ = 1,L) 
or (ii) a model for b(L ) = E[Y \ A\ = 1, L, A 2 = 1] is 
correct. When L is high dimensional and, as in an 
observational study, 7r(-) is unknown, double robust¬ 
ness is a desirable property because model misspec- 
ification is generally unavoidable, even when we use 
flexible, high dimensional, semiparametric models in 
(i) and (ii). In fact, DR estimators have advantages 
even when, as is usually the case, the models in (i) 
and (ii) are both incorrect. This happens because 
the bias of the DR estimator fid r is of second order, 
and thus generally less than the bias of a non-DR 
estimator (such as a standard IPTW estimator). By 
second order, we mean that the bias of fid r depends 
on the product of the error made in the estimation 
of Pr(A 2 = 1 | A\ = 1, L) times the error made in the 
estimation of E[Y \ A\ = \,L 1 A 2 = 1]. 

Scharfstein, Rotnitzky and Robins (1999) noted 
that the locally efficient estimator of Robins, Rot¬ 
nitzky and Zhao (1994) 

fidr = {PnlAi ]} -1 


lHm Y 


A 2 

*(L) 


- 1 



is doubly robust where n(L) and b(L) are estimators 
of vr(L) and b{L). Unfortunately, in finite samples 
this estimator may fail to lie in the parameter space 
for ji, that is, the interval [0,1] if Y is binary. In 
response, Scharfstein, Rotnitzky and Robins (1999) 
proposed a plug-in DR estimator, the doubly robust 
regression estimator 

fidr,reg = {^n^l ]} _1 F„{Ai6(L)}, 


where now b{L ) = expit{m(L; r)) + 6/n(L)} and 
(rj, 9) are obtained by fitting by maximum likelihood 
the logistic regression model Pr(y = 1 | A\ = 1, 
L,A 2 = 1) = expit{m(L; rj) + 8/tt(L)} to subjects 
with A\ = 1 , A 2 = 1. Here, m(L; rj) is a user-specified 
function of L and of the Euclidean parameter rj. 

Robins (1999a) and Bang and Robins (2005) ob¬ 
tained plug-in DR regression estimators in longitu¬ 
dinal missing data and causal inference models by 
reexpressing the g-formula as a sequence of iterated 
conditional expectations. 

van der Laan and Rubin (2006) proposed a clever 
general method for obtaining plug-in DR estima¬ 
tors called targeted maximum likelihood. In our 
setting, the method yields an estimator /2dr,TMLE 
that differs from fidr,reg only in that b(L) is now 
given by expit{m(L) + d greedy /7r(L)} where 9 gieedy 
is again obtained by maximum likelihood but with 
a fixed offset rh(L). This offset is an estimator of 
Pr(T = 1 | A\ = l,L,A 2 = 1) that might be ob¬ 
tained using ffexible machine learning methods. 
Similar comments apply to models considered by 
Bang and Robins (2005). Since 2006 there has been 
an explosion of research that has produced dou¬ 
bly robust estimators with much improved large 
sample efficiency and finite sample performance; 
Rotnitzky and Vansteelandt (2014) give a review. 

We note that CAR models are not the only models 
that admit doubly robust estimators. For example, 
Scharfstein, Rotnitzky and Robins (1999) exhibited 
doubly robust estimators in models with nonignor- 
able missingness. Robins and Rotnitzky (2001) de¬ 
rived sufficient conditions, satisfied by many non- 
CAR models, that imply the existence of doubly ro¬ 
bust estimators. Recently, doubly robust estimators 
have been obtained in a wide variety of models. See 
Dudik et al. (2014) in this volume for an interesting 
example. 

6. HIGHER ORDER INFLUENCE FUNCTIONS 

It may happen that the second-order bias of a 
doubly-robust estimator fid r decreases slower to 0 
with n than n -1 / 2 , and thus the bias exceeds the 
standard error of the estimator. In that case, con¬ 
fidence intervals for /i based on fid r fail to cover 
at their nominal rate even in large samples. Fur¬ 
thermore, in such a case, in terms of mean squared 
error, fid r does not optimally trade off bias and 
variance. In an attempt to address these problems, 
Robins et al. (2008) developed a theory of point and 
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interval estimation based on higher order influence 
functions and use this theory to construct estima¬ 
tors of /,t that improve on /2dr- Higher order in¬ 
fluence functions are higher order U-statistics. The 
theory of Robins et al. (2008) extends to higher or¬ 
der the first order semiparametric inference theory 
of Bickel et al. (1993) and van der Vaart (1991). In 
this issue, van der Vaart (2014) gives a masterful re¬ 
view of this theory. Here, we present an interesting 
result found in Robins et al. (2008) that can be un¬ 
derstood in isolation from the general theory and 
conclude with an open estimation problem. 

Robins et al. (2008) consider the question of 
whether, for estimation of a conditional variance, 
random regressors provide for faster rates of con¬ 
vergence than do fixed regressors, and, if so, how? 
They consider a setting in which n i.i.d. copies of 
( Y , X ) are observed with X a d-dimensional random 
vector, with bounded density /(■) absolutely con¬ 
tinuous w.r.t. the uniform measure on the unit cube 
(0, l) d . The regression function b(-) = E[Y \ X = ■] 
is assumed to lie in a given Holder ball with Holder 
exponent /3 < l. 46 The goal is to estimate E\Vai{Y \ 
V}] under the homoscedastic semiparametric model 
Var[V | X] = a 2 . Under this model, the authors con¬ 
struct a simple estimator a 2 that converges at rate 
n -(4d/ d )/( 1 + 4 / 3 / rf ) ; when /3/d < 1/4. 

Wang et al. (2008) and Cai, Levine and Wang 
(2009) earlier proved that if Xi,i = 1,... ,n, are 
nonrandom but equally spaced in ( 0 , l) d , the min¬ 
imax rate of convergence for the estimation of a 2 
is (when /3/d < 1/4) which is slower than 

n -(4/3/d)/(i+4/3/d). Thus, randomness in X allows for 
improved convergence rates even though no smooth¬ 
ness assumptions are made regarding /(•). 

To explain how this happens, we describe the 
estimator of Robins et al. (2008). The unit cube 
in M. d is divided into k = k(n ) = n 7 , 7 > 1 identi¬ 
cal subcubes each with edge length k ~ x ^ d . A sim¬ 
ple probability calculation shows that the number 
of subcubes containing at least two observations 
is O p (n 2 /k). One may estimate o 2 in each such 
subcube by (V — l/) 2 /2 . 47 An estimator d 2 of a 2 
may then be constructed by simply averaging the 


46 A function fe(-) lies in the Holder ball H (/?, C) with Holder 
exponent /3 > 0 and radius C > 0, if and only if b(-) is bounded 
in supremum norm by C and all partial derivatives of b[x) up 
to order |_/3J exist, and all partial derivatives of order [_/3J are 
Lipschitz with exponent (/3 — |_/3J) and constant C. 

47 If a subcube contains more than two observations, two 
are selected randomly, without replacement. 


subcube-specific estimates (Y) — Y/y ) 2 /2 over all the 
sub-cubes with at least two observations. The rate 
of convergence of the estimator is maximized at 
n -(4/9/d)/(i+40/d) by taking k = n 2 /(^+^/d) « 

Robins et al. (2008) conclude that the random de¬ 
sign estimator has better bias control, and hence 
converges faster than the optimal equal-spaced 
fixed X estimator, because the random design es¬ 
timator exploits the O p (n 2 /n 2 ^ 1+ ‘ i/ ^ d ' > ) random 
fluctuations for which the A’s corresponding to 
two different observations are only a distance of 
0({n 2 /( 1+4 ^/ d ^}~ 1 / d ) apart. 

An Open Problem 49 

Consider again the above setting with random X. 
Suppose that (3/d remains less than 1/4 but now 
/3 > 1. Does there still exist an estimator of er 2 that 
converges at n _ l 4 ^/ d )/(i+4/3/d)? Analogy with other 
nonparametric estimation problems would suggest 
the answer is “yes,” but the question remains un¬ 
solved . 50 

7. OTHER WORK 

The available space precludes a complete treat¬ 
ment of all of the topics that Robins has worked on. 
We provide a brief description of selected additional 
topics and a guide to the literature. 

Analyzing Observational Studies as Nested 
Randomized Trials 

Hernan et al. (2008) and Hernan, Robins and 
Garcia Rodriguez (2005) conceptualize and ana¬ 
lyze observational studies of a time varying treat- 


48 Observe that E[{Yi - Yj) 2 /2 \ Xi,Xj\ = a 2 + {b(Xi) - 
b(Xj)} 2 /2, | b(Xi) - b{Xj)\ = 0(\\Xi - Xjf) as < 1, and 
|| Xi — Xj || = d} 22 0{k~ 12d ) when Xi and Xj are in the same 
subcube. It follows that the estimator has variance of order 
k/n 2 and bias of order 0(fc~ 2S7ti ). Variance and the squared 

bias are equated by solving k/n 2 = fc -4/37d which gives k = 
n 2i(i+Ap/d) 

49 Robins has been trying to find an answer to this question 
without success for a number of years. He suggested that it is 
now time for some crowd-sourcing. 

50 The estimator given above does not attain this rate when 
/3 > 1 because it fails to exploit the fact that &(•) is differen¬ 
tiable. In the interest of simplicity, we have posed this as a 
problem in variance estimation. However, Robins et al. (2008) 
show that the estimation of the variance is mathematically 
isomorphic to the estimation of 9 in the semi-parametric re¬ 
gression model E[Y \ A, X] = 9A + h(X), where A is a binary 
treatment. In the absence of confounding, 9 encodes the causal 
effect of the treatment. 
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ment as a nested sequence of individual RCTs tri¬ 
als run by nature. Their analysis is closely re¬ 
lated to g-estimation of SNM (discussed in Sec¬ 
tion 2.6). The critical difference is that in these 
papers Robins and Hernan do not specify a SNM 
to coherently link the trial-specific effect estimates. 
This has benefits in that it makes the analysis eas¬ 
ier and also more familiar to users without training 
in SNMs. The downside is that, in principle, this 
lack of coherence can result in different analysts rec¬ 
ommending, as optimal, contradictory interventions 
(Robins, Hernan and Rotnitzky 2007). 

Adjustment for “Reverse Causation” 

Consider an epidemiological study of a time- de¬ 
pendent treatment (say cigarette smoking) on time 
to a disease of interest, say clinical lung cancer. 
In this setting, uncontrolled confounding by unde¬ 
tected preclinical lung cancer (often referred to as 
“reverse causation”) is a serious problem. Robins 
(2008) develops analytic methods that may still pro¬ 
vide an unconfounded effect estimate, provided that 
(i) all subjects with preclinical disease severe enough 
to affect treatment (i.e., smoking behavior) at a 
given time t will have their disease clinically diag¬ 
nosed within the next x, say 2 years and (ii) based 
on subject matter knowledge an upper bound, for 
example, 3 years, on x is known. 

Causal Discovery 

Spirtes, Glymour and Schemes (1993) and Pearl 
and Verma (1991) proposed statistical methods that 
allowed one to draw causal conclusions from asso- 
ciational data. These methods assume an underly¬ 
ing causal DAG (or equivalently an FFRCISTG). If 
the DAG is incomplete, then such a model imposes 
conditional independence relations on the associ¬ 
ated joint distribution (via d-separation). Spirtes, 
Glymour and Schemes (1993) and Pearl and Verma 
(1991) made the additional assumption that all con¬ 
ditional independence relations that hold in the dis¬ 
tribution of the observables are implied by the un¬ 
derlying causal graph, an assumption termed “sta¬ 
bility” by Pearl and Verma (1991), and “faithful¬ 
ness” by Spirtes, Glymour and Scheines (1993). 
Under this assumption, the underlying DAG may 
be identified up to a (“Markov”) equivalence class. 
Spirtes, Glymour and Scheines (1993) proposed two 
algorithms that recover such a class, entitled “PC” 
and “FCI.” While the former presupposes that there 


are no unobserved common causes, the latter explic¬ 
itly allows for this possibility. 

Robins and Wasserman (1999) and Robins et al. 
(2003) pointed out that although these procedures 
were consistent they were not uniformly consistent. 
More recent papers (Kalisch and Biihlmann (2007); 
Colombo et al. (2012)) recover uniform consistency 
for these algorithms by imposing additional assump¬ 
tions. Spirtes and Zhang (2014) in this volume ex¬ 
tend this work by developing a variant of the PC Al¬ 
gorithm which is uniformly consistent under weaker 
assumptions. 

Shpitser et al. (2012, 2014), building on Tian and 
Pearl (2002b) and Robins (1999b) develop a theory 
of nested Markov models that relate the structure of 
a causal DAG to conditional independence relations 
that arise after re-weighting; see Section 3.4. This 
theory, in combination with the theory of graphical 
Markov models based on Acyclic Directed Mixed 
Graphs (Richardson and Spirtes (2002); Richard¬ 
son (2003); Wermuth (2011); Evans and Richardson 
(2014); Sadeghi and Lauritzen (2014)), will facilitate 
the construction of more powerful 51 causal discov¬ 
ery algorithms that could (potentially) reveal much 
more information regarding the structure of a DAG 
containing hidden variables than algorithms (such 
as FCI) that solely use conditional independence. 

Extrapolation and Transportability of Treatment 
Effects 

Quality longitudinal data is often only avail¬ 
able in high resource settings. An important ques¬ 
tion is when and how can such data be used to 
inform the choice of treatment strategy in low 
resource settings. To help answer this question, 
Robins, Orellana and Rotnitzky (2008) studied the 
extrapolation of optimal dynamic treatment strate¬ 
gies between two HIV infected patient populations. 
The authors considered the treatment strategies g x , 
of the same form as those defined in Section 2.10, 
namely, “start anti-retroviral therapy the first time 
at which the measured CD4 count falls below x. n 
Given a utility measure Y, their goal is to find the 
regime g Xopt that maximizes E[Y(g x )\ in the sec¬ 
ond low-resource population when good longitudi¬ 
nal data are available only in the first high-resource 
population. Due to differences in resources, the fre¬ 
quency of CD4 testing in the first population is much 


51 But still not uniformly consistent! 
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greater than in the second and, furthermore, for lo¬ 
gistical and/or financial reasons, the testing frequen¬ 
cies cannot be altered. In this setting, the authors 
derived conditions under which data from the first 
population is sufficient to identify g Xopt and con¬ 
struct IPTW estimators of g x t under those condi¬ 
tions. A key finding is that owing to the differential 
rates of testing, a necessary condition for identifica¬ 
tion is that CD4 testing has no direct causal effect 
on Y not through anti-retroviral therapy. In this is¬ 
sue, Pearl and Bareinboim (2014) study the related 
question of transportability between populations us¬ 
ing graphical tools. 

Interference, Interactions and Quantum 
Mechanics 

Within a counterfactual causal model, Cox (1958) 
defined there to be interference between treatments 
if the response of some subject depends not only 
on their treatment but on that of others as well. On 
the other hand, VanderWeele and Robins (2009) de¬ 
fined two binary treatments ( 01 , 02 ) to be causally 
interacting to cause a binary response Y if for 
some unit Y (1,1) ^ Y (1,0) = Y (0,1); VanderWeele 
(2010a) defined the interaction to be epistatic if 
y(l,l) / v(l,0) = y(0,l) = y(0,0). VanderWeele 
with his collaborators has developed a very gen¬ 
eral theory of empirical tests for causal interac¬ 
tion of different types (VanderWeele and Robins 
(2009); VanderWeele (2010a), 2010b; VanderWeele 
and Richardson (2012)). 

Robins, VanderWeele and Gill (2012) showed, per¬ 
haps surprisingly, that this theory could be used to 
give a simple but novel proof of an important re¬ 
sult in quantum mechanics known as Bell’s theo¬ 
rem. The proof was based on two insights: The first 
was that the consequent of Bell’s theorem could, 
by using the Neyman causal model, be recast as 
the statement that there is interference between a 
certain pair of treatments. The second was to recog¬ 
nize that empirical tests for causal interaction can 
be reinterpreted as tests for certain forms of interfer¬ 
ence between treatments, including the form needed 
to prove Bell’s theorem. VanderWeele et al. (2012) 
used this latter insight to show that existing em¬ 
pirical tests for causal interactions could be used to 
test for interference and spillover effects in vaccine 
trials and in many other settings in which inter¬ 
ference and spillover effects may be present. The 
papers Ogburn and VanderWeele (2014) and Van¬ 
derWeele, Tchetgen Tchetgen and Halloran (2014) 


in this issue contain further results on interference 
and spillover effects. 

Multiple Imputation 

Wang and Robins (1998) and Robins and Wang 
(2000) studied the statistical properties of the mul¬ 
tiple imputation approach to missing data (Rubin 
(1987)). They derived a variance estimator that is 
consistent for the asymptotic variance of a multi¬ 
ple imputation estimator even under misspecifica- 
tion and incompatibility of the imputation and the 
(complete data) analysis model. They also charac¬ 
terized the large sample bias of the variance estima¬ 
tor proposed by Rubin (1978b). 

Posterior Predictive Checks 

Robins, van der Vaart and Ventura (2000) stud¬ 
ied the asymptotic null distributions of the poste¬ 
rior predictive p-value of Rubin (1984) and Guttman 
(1967) and of the conditional predictive and partial 
posterior predictive p-values of Bayarri and Berger 
(2000). They found the latter two p-values to have 
an asymptotic uniform distribution; in contrast they 
found that the posterior predictive p-value could be 
very conservative, thereby diminishing its power to 
detect a misspecified model. In response, Robins et 
al. derived an adjusted version of the posterior pre¬ 
dictive p-value that was asymptotically uniform. 

Sensitivity Analysis 

Understanding that epidemiologists will almost 
never succeed in collecting data on all covariates 
needed to fully prevent confounding by unmeasured 
factors and/or nonignorable missing data, Robins 
with collaborators Daniel Scharfstein and Andrea 
Rotnitzky developed methods for conducting sensi¬ 
tivity analyses. See, for example, Scharfstein, Rot¬ 
nitzky and Robins (1999), Robins, Rotnitzky and 
Scharfstein (2000) and Robins (2002, pages 319- 
321). In this issue, Richardson et al. (2014) describe 
methods for sensitivity analysis and present several 
applied examples. 

Public Health Impact 

Finally, we have not discussed the large impact 
of the methods that Robins introduced on the sub¬ 
stantive analysis of longitudinal data in epidemiol¬ 
ogy and other fields. Many researchers have been 
involved in transforming Robins’ work on time- 
varying treatments into increasingly reliable, robust 
analytic tools and in applying these tools to help 
answer questions of public health importance. 
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LIST OF ACRONYMS USED 


CAR: Section 5 

coarsened at random. 

CD4: Section 2.2 

(medical) cell line depleted by 
HIV. 

CDE 

Section 3.1 

controlled direct effect. 

CMA 

Section 2.3 

causal Markov assumption. 

DAG 

Section 2.3 

directed acyclic graph. 

DR 

Section 5 

doubly robust. 

dSWIG 

Section 2.4 

dynamic single-world interven¬ 
tion graph. 

FFRCISTG 

Section 2.2 

finest fully randomized causally 
interpreted structured tree 

graph. 

HIV: Section 2.2 

(medical) human immunodefi¬ 
ciency virus. 

IPCW 

Section 2.9 

inverse probability of censoring 
weighted. 

IPTW 

Section 2.10 

inverse probability of treatment 
weighted. 

ITT 

Section 2.9 

intention to treat. 

MI 

Section 3.3 

(medical) myocardial infarction. 

MSM 

Section 2.10 

marginal structural model. 

NPSEM 

Section 2.2 

nonparametric structural equa¬ 
tion model. 

NPSEM-IE 

Section 2.2 

nonparametric structural equa¬ 
tion model with independent er¬ 
rors. 

PDE 

Section 3.3 

pure direct effects. 

PSDE 

Section 3.2 

principal stratum direct effects. 

RCT 

Section 2.2 

randomized clinical trial. 

SNM 

Section 2.6 

structural nested model. 

SNDM 

Section 2.6 

structural nested distribution 
model. 

SNFTM 

Section 2.6 

structural nested failure time 
model. 

SNMM 

Section 2.6 

structural nested mean model. 

SWIG 

Section 2.3 

single-world intervention graph. 

TIE 

Section 3.3 

total indirect effect. 
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